User-configurable OCR enhancement for online natural history archives
Identifieur interne : 000F19 ( Main/Exploration ); précédent : 000F18; suivant : 000F20User-configurable OCR enhancement for online natural history archives
Auteurs : Andy Downton [Royaume-Uni] ; JINGYU HE [Royaume-Uni] ; Simon Lucas [Royaume-Uni]Source :
- International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2007.
Descripteurs français
- Pascal (Inist)
- Reconnaissance caractère, Reconnaissance optique caractère, Bibliothèque électronique, Traitement image document, Mot, Langage naturel, Base donnée, Zone donnée, Disponibilité, Internet, Analyse documentaire, Archive électronique, Système construction, Analyse texte, Musée, Réseau web, Validation, Dictionnaire électronique.
- Wicri :
- topic : Base de données, Musée.
English descriptors
- KwdEn :
- Availability, Character recognition, Construction system, Data field, Database, Digital archive, Document analysis, Document image processing, Electronic dictionary, Electronic library, Internet, Museum, Natural language, Optical character recognition, Text analysis, Validation, Word, World wide web.
Abstract
The creation of structured digital libraries from paper-based archives is an area of growing demand in many scientific and cultural fields, and is not satisfied either by off-the-shelf OCR or commercial form-processing systems. This paper describes and evaluates a configurable archive construction system, which integrates document image pre-processing and analysis with text post-processing tools and a standard OCR package to meet digital archiving requirements. The prototype system is currently being used in conjunction with the UK Natural History Museum to help convert more than 500,000 cards of Lepidoptera (Butterflies and Moths) and Coleoptera (Beetles) to searchable digital archives. Evaluation results covering different aspects of the system from card scanning to overall word recognition rates for different database fields are summarised for two datasets comprising over 5,000 cards selected from different parts of these archives. First-pass end-to-end word recognition rates of 70-90% are reported for key data fields, subject to availability of suitable electronic dictionaries. Further validation and correction is supported through web-editing of the online digital archive.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000320
- to stream PascalFrancis, to step Curation: 000466
- to stream PascalFrancis, to step Checkpoint: 000261
- to stream Main, to step Merge: 000F32
- to stream Main, to step Curation: 000F19
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">User-configurable OCR enhancement for online natural history archives</title>
<author><name sortKey="Downton, Andy" sort="Downton, Andy" uniqKey="Downton A" first="Andy" last="Downton">Andy Downton</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Jingyu He" sort="Jingyu He" uniqKey="Jingyu He" last="Jingyu He">JINGYU HE</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Lucas, Simon" sort="Lucas, Simon" uniqKey="Lucas S" first="Simon" last="Lucas">Simon Lucas</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">07-0469293</idno>
<date when="2007">2007</date>
<idno type="stanalyst">PASCAL 07-0469293 INIST</idno>
<idno type="RBID">Pascal:07-0469293</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000320</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000466</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000261</idno>
<idno type="wicri:doubleKey">1433-2833:2007:Downton A:user:configurable:ocr</idno>
<idno type="wicri:Area/Main/Merge">000F32</idno>
<idno type="wicri:Area/Main/Curation">000F19</idno>
<idno type="wicri:Area/Main/Exploration">000F19</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">User-configurable OCR enhancement for online natural history archives</title>
<author><name sortKey="Downton, Andy" sort="Downton, Andy" uniqKey="Downton A" first="Andy" last="Downton">Andy Downton</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Jingyu He" sort="Jingyu He" uniqKey="Jingyu He" last="Jingyu He">JINGYU HE</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Lucas, Simon" sort="Lucas, Simon" uniqKey="Lucas S" first="Simon" last="Lucas">Simon Lucas</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2007">2007</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Availability</term>
<term>Character recognition</term>
<term>Construction system</term>
<term>Data field</term>
<term>Database</term>
<term>Digital archive</term>
<term>Document analysis</term>
<term>Document image processing</term>
<term>Electronic dictionary</term>
<term>Electronic library</term>
<term>Internet</term>
<term>Museum</term>
<term>Natural language</term>
<term>Optical character recognition</term>
<term>Text analysis</term>
<term>Validation</term>
<term>Word</term>
<term>World wide web</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Bibliothèque électronique</term>
<term>Traitement image document</term>
<term>Mot</term>
<term>Langage naturel</term>
<term>Base donnée</term>
<term>Zone donnée</term>
<term>Disponibilité</term>
<term>Internet</term>
<term>Analyse documentaire</term>
<term>Archive électronique</term>
<term>Système construction</term>
<term>Analyse texte</term>
<term>Musée</term>
<term>Réseau web</term>
<term>Validation</term>
<term>Dictionnaire électronique</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Base de données</term>
<term>Musée</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">The creation of structured digital libraries from paper-based archives is an area of growing demand in many scientific and cultural fields, and is not satisfied either by off-the-shelf OCR or commercial form-processing systems. This paper describes and evaluates a configurable archive construction system, which integrates document image pre-processing and analysis with text post-processing tools and a standard OCR package to meet digital archiving requirements. The prototype system is currently being used in conjunction with the UK Natural History Museum to help convert more than 500,000 cards of Lepidoptera (Butterflies and Moths) and Coleoptera (Beetles) to searchable digital archives. Evaluation results covering different aspects of the system from card scanning to overall word recognition rates for different database fields are summarised for two datasets comprising over 5,000 cards selected from different parts of these archives. First-pass end-to-end word recognition rates of 70-90% are reported for key data fields, subject to availability of suitable electronic dictionaries. Further validation and correction is supported through web-editing of the online digital archive.</div>
</front>
</TEI>
<affiliations><list><country><li>Royaume-Uni</li>
</country>
</list>
<tree><country name="Royaume-Uni"><noRegion><name sortKey="Downton, Andy" sort="Downton, Andy" uniqKey="Downton A" first="Andy" last="Downton">Andy Downton</name>
</noRegion>
<name sortKey="Jingyu He" sort="Jingyu He" uniqKey="Jingyu He" last="Jingyu He">JINGYU HE</name>
<name sortKey="Lucas, Simon" sort="Lucas, Simon" uniqKey="Lucas S" first="Simon" last="Lucas">Simon Lucas</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F19 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000F19 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:07-0469293 |texte= User-configurable OCR enhancement for online natural history archives }}
This area was generated with Dilib version V0.6.32. |